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1. INTRODUCTION 

One of the most popular and highly perceived occasion-timing rituals in religious circles all over the 
world is Hajj. About three million pilgrims come to Mecca to perform Hajj within 5-6 days every year, 
requiring them to tour several locations in mecca. The Hajj authorities need to plan the crowd movement by 
knowing the maximum capacity of every Hajj check points to manage the crowd. In addition, they have to 
foresee any possibility of dangerous condition which can lead to crowd stamping. Even though plan has been 
devised, the Hajj authorities still facing difficulties in executing the plan which can be reflected based on 
their outcomes, 1.e. crowd stamping tragedies in September 2015 that results in 2000 pilgrim’s death [1]-[3]. 

To monitor the crowd movement, video surveillance is installed around the kaabah, saei and 
jamaraat areas. This is due to tawaf and saei areas are the most crowded place during the rituals. However, 
the surveillance cameras do not have the intelligence to inform the Hajj authorities whether the condition of 
the crowd at a particular time is dangerous or not. Hence, it is important for a surveillance camera to be able 
to alert Hajj authorities, should dangerous scenario be about to happen, so that crowd stamping can be prevented. 
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Vinayakumar et al. [4], introduced the DSPNet, which covers multi-dimensional functions for vast 
numbers of crowd analysis. The current problem of counting numbers to estimate crowd density in extremely 
congested scenes is particularly to be addressed, since the method are not suitable for congested scenes. The 
DSPNet model was first used for an interface and meaning. The frontline is a default deep-neural, neural 
network, while the deep-neural core backend network uses a total knowledge integration ratio at various 
stages. The SCA cleaner module allows the multiscal functions to be integrated and image representations 
enhanced. 

Analysis of crowd condition in surveillance camera is a difficult task due to excessive occlusions, 
inconsistencies in scene perception and multiple distributions of crowds. In normal crowded scenes, people 
detection and monitoring are difficult. This 1s even more difficult in Hajj scenario where there are too many 
moving people. 

Other approaches to crowd analysis have been resulted in several semi-automated solutions for density 
estimation and crowd counting [5]-[7]. Effective implementation of the semi-automated solutions, however, 1s 
limited by two significant limitations; (1) lack of capacity to accommodate hundreds or thousands of crowds, 
rather than a few tens of people [5]; and (2) dependence on temporal limitations in crowd videos that do not 
extend to still images which are more prevalent [8]. In addition to arbitrary camera position and crowd density, 
there are still problem of erroneous crowd counts from an arbitrary still image [7]. 

This paper proposes moving-scene crowd analysis model. We aim at analyzing a mapping from 
motion pictures to crowd analysis and therefore applying the mapping to cross-scene crowd analysis within 
the target scenes. We propose a model for moving scene crowd analysis. One of the challenges in 
implementing our method is to utlized the dataset. Among available dataset related to crowd images are the 
50 static images obtained from Flickr from different crowd scenes [9]. Other commonly used UCSD dataset 
were made up of movies collected from one or more scenes [10], [5]. In this work, we obtained our dataset 
from the YouTube and the detailed dataset is discussed in methodology section. 

We propose a fully-fledged system for moving scene crowd video analysis based on the convolution 
neural network (CNN). We a method that leverage on CNN model particularly suit for Hajj applications. 
Also, this work introduces a system for counting and then estimating the crowd density. The CNN version is 
taught through an improved technique for crowd scenes to gain knowledge of targets, crowd density and 
crowd counts [11], [12]. Our edition of CNN learns crowd-specific capabilities that give better performance 
than handmade features. -The paper is organized as shown in; section 2 describes the details of the proposed 
system. Section 3 presents the Hajj-Crowd dataset. Section 4 presents the experimental setup and result 
analysis. Finally, section 5 concludes the paper and presents the future works. 


2. RESEARCH METHOD 

The proposed model is built based on a robust Hajj crowd counting detection design. The proposed 
model predicts precisely localized boxes in Hajj crowd images on people's heads. While it seems like a multi- 
stage process to find the head size for each person, we develop it as a one-stage end-to-end system. Figure 1 
shows the proposed architecture of CNN based on 3 technical components; firstly, we have extraction of the 
frame. 





Figure 1. Proposed CNN architecture 


For frame extraction we have collected few hajj crowds’ videos and after that we did 30 frame 
extraction from those videos. Second is spatial feature extraction by the feature extractor at multiple 
resolutions. We have used the CNN based prediction map. This function maps are forwarded to a set of 
multi-scale feedback reasoning network (MSFRN) based on CNN where information is fused across the 
scales and predictions are made via boxes. The final output of crowd density is produced by using the non- 
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maximum suppression (NMS) which combines the correct detection from numerous resolutions. For model 
preparation, the final stage is replaced by the grid-winner-take-all (GWTA) section and for loss calculation 
we have used the back-propagation algorithm. It is considered in the next section as details of all the CNN 
functional components which include the algorithm for training. 


2.1. CNN layer architecture 

All current detectors of CNN objects run on top of deep backbone feature extractor network. 
Moreover, the functionality consistency may directly have an impact on detection accuracy. The CNN 
enabled networks are usually employed in crowd counting in various ways and offer near-stage performance 
[13]. In accordance with the pattern, multiple layers are also employed in CNN convolution to ensure 
improved crowd feature extraction. The backbone network is generated by the first five CNN convolution 
blocks initialized with ImageNet trained [14]. The network takes in a fixed size RGB crowd image (224 x 
224) as input with down-sampling of the data for each block because of max-pooling. At each block the 
network branches, except for the final ones, which are duplicated using the subsequent block. These copied 
blocks are used for feature map build with a resolution of 0.5, 0.25, 0.125 and 0.166. It contrasts favorably 
with traditional features of the hyper board and helps to differentiate each branch of the scale by exchanging 
low-level features in a conflict free manner. Half the spatial scale's low-level features might theoretically 
catch and handle very large crowds from the data [15]. On the other hand, the minor resolution value 
divisions have a gradually higher-level receptive area which are ideal with regard to those data that have 
comparatively limited package. The dimension of diversity is taken care of by providing columns of various 
sizes for everyone to specialize in a different crowd type. 


2.2. Box classification 

We select a per pixel classification paradigm for the sizing. Essentially, a series of bounding boxes 
given prespecified sizes, in which case the model basically does classify each head into or as context to one 
of the boxes. It compares with the anchor box model commonly used in detector where the box parameters 


are regressed [16]. Model scale branches generate map set (Day = 0, showing the confidence level for each 


pixel for classes of the box. Next, ground truth sizes for the heads are required for model training which is 
unavailable and hard for conventional large sized crowd databases to annotate. In this work, we are 
developing a system for approximating head sizes. We depend on the point annotations available with crowd 
data-sets for generating ground truth. Such annotations point defines the positions of people's heads. The 
position is in the middle of the head roughly however may vary greatly in the case of sparse crowds. Besides, 
recognizing each individual within the crowd, some information about scale can also be provided by the 
annotation points on scale. The gap between two adjacent individuals, based on the assumption of the 
uniform density of the crowd, may revealing the bordering size of the box as per the heads. Remember that 
only quadratic boxes are considered. Precisely, a given head size may be considered in simple terms as the 
length from the closest neighbor. Much as this is reasonable for the case of medium to large sized crowds, for 
the individual’s sparsely populated crowds, with a distant nearest neighbor, this may result in incorrect box 
sizes. But this is considered as empirically producing fairly good head sizing across a large range of 
densities. 

Illustrate here is a pseudo mathematical development of the ground truth. Consider P as set of all the 
individual’s annotated (x, y) positions as per the provided image patch. Thus, the size of the box is specified 
as for every point (x, y) within P, 

— min — a \2 fer 

BIN) Cree wean eo ey) (1) 

The space right from the closest neighbor. Assuming there exist just one individual in the patch of 
the image, the box size will be viewed as an option œ. Next, the values of 6 [x, y] discretion to the already 


defined bind that determine the box sizes. Supposing fp p A cates — box sizes are defined for scale, the 
B 
position classification box labels (x, y) becomes, 
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A popular principle is pursued in selecting the box size B= S for each scale. The first box size 


(b=1) at the peak resolution scale (s=ns—1) is usually fixed to one, enhancing the capacity to address highly 
crowded density. For instance, the boxes that remain we choose larger sizes as per this scale with a 
continuous increase. The change is fine-grained in higher-resolution partitions however regarding resolutions 
that are low by definition there will be a steady increase in the coarseness. Precisely, assuming y* indicates 
the increment in size for s, we set the box sizes as given next, 


1, 
s= p— ifs<ns—1 3) 
b (14+ (b-1)y%, Otherwise 


The standard size increment values for the various scales have the definition y={4, 2, 1, 1}. Observe 
that high-level resolution divisions (0.5 and 0.25) do contain boxes of better sizes compared to the ones of 
low resolution (0.16 and 0.25), in which coarse resolution capacity would be adequate (as depicted in 
Figure 2) [17]. 





Figure 2. The pseudo box produces ground truth using formula presented in (3) 


2.3. Grid winner-take-all training 
Calculation of loss: the CNN is trained using the back-propagation of entropy loss per pixel. Each 
pixel loss is defined, 


b b b 
l (a = 0, {b;} — = 0, (a) = 0) = y Qj Bi toga, (4) 


in that {dj = 0 is taken as the set of nb+1 probability as per the already defined box classes and where 


{b;} ne = 0 means the label of the appropriate ground truth. Next the loss of the total scale branch 
predictions 


= WS HS LU{ DF} {Bp} ap 
IDE}, {B5°} a) = Ery (pihe Hash 
in which case the inputs are a set of predictions {D5} me = 0 and pseudo ground truths (for convenience, 


defined limits may be dropped). Note that (w5, hë) are the maps of these for estimation spatial sizes and the 
lack of cross-entropy is summed up: 


Lcomb = pen L({D5}, {By *}, {ap}) (5) 


2.4. Count of heads 

The predictive fusion process is used instead of GWTA to test the model as illustrated in Figure 1. 
All branches evaluate the image input and result in predictions of multi-resolution. Based on these prediction 
charts, the box positions are obtained from and linearly scaled to input resolution. To avoid multi-threshold 
mixing, the current NMS is then applied. The boxes following the NMS form the model's final prediction, 
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and are listed to generate crowd count. The NMS threshold is defined to be 0.3 (30 percent area overlap) to 
encourage intermediate assessments during training. But a threshold search is run to determine the most 
appropriate model upon training to minimize the counting error over the validation set. Figure 3 shows the 
image prediction made by CNN. Here in Figure 3 (a) represent the original image, Figure 3 (b) CNN 
prediction with 3 x 3 boxes and Figure 3 (c) depection of the map prediction. 





(b) (c) 


Figure 3. This prediction made by CNN on images from Hajj-crowd data-set, (a) instead of image prediction, 
(b) CNN prediction, (c) map prediction 


3. HAJJ-CROWD DATA-SETS 
This segment discusses the HAJJ-Crowd proposed from three perspectives: data capture and 
specification definition, method of annotation and experiment. 


3.1. Data capture and specification definition 

The collection of HAJJ-crowd data is done from the YOUTUBE’S live telecast in Mecca hajj 2019. 
Accordingly, in some populated surrounding Kabba (Tawaf area), 1000 images and 10 video sequences are 
recorded, containing some typical crowd scenes, including touching the black stone in the kabba area, 
throwing the stone into the mina area. In addision, we have collected 500 samples from Google through the 
typical crowd-related query keywords. At last, 1500 raw images are obtained by the two methods esribed 
above. Figure 4 shows the example of the proposed Hajj-crowd dataset. 


Ra 
N 





Figure 4. Hajj-crowd dataset 


3.2. Method of annotation (tools) 

As an annotation tool, we have utilized on Python and open-cv for easy annotation of head points in 
the crowd photos. The method supports two label forms, namely point and bounding box. Every image is 
flexibly zoomed in/out during the annotation process to annotate head with different scales and is divided 
into a maximum of 3 x 3 small patches, enabling annotators to mark the head within five sizes: 2x 
(x=0,1,2,3,4) times the initial image dimensions. 
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4. EXPERIMENTAL SETUP AND RESULT ANALYSIS 
4.1. Experimental setup 

The CNN model aims to optimize the loss function using the back-propagation algorithm. First, we 
collected all the images and size of images 1280x720 resolution and the labels are generated under this size. 
Second, we used deep learning algorithm to enhance CNN and obtained the best results. The training and 
analysis are implemented on NVIDIA GEFORCE GTX 1660Ti GPU using the deep learning packages 
PyTorch framework and operating system ubuntu 18.4 LTE. Finally, we used python3 with the deep learning 
packages such as open cv2, NumPy, SciPy, matplotlib, torch vision, among others. 


4.2. Experiment 

The collection of HAJJ-crowd data is divided into three parts based on 1,500 images, namely the 
testing, validation, and training. Based on, two metrics are to measure the counting accuracy that is MAE and 
MSE. This can be equation as shown in: 


1 i vo 
MAE =— Yiealyi — y`il (6) 


1 ` 
MSE= rL [yi — yil? (7) 


In which case N is taken to be the samples in the set of tests, yi is considered to be the count mark 
whereas y'i is the approximated count sample. Additionally, an examination of the model from various 
viewpoints. The previous has 5 groups by number of people: 0, (0, and 1000), (1000, 2000), (2000, 3000), 
and over 3000. Each image is assigned an attribute labels according to its annotated counting number and 
quality of the image. In the experimental set, MAE and MSE are applied in a specific perspective for each 
class to the corresponding samples. For instance, the luminescence attribute, the computed average figures of 
MSE and MAE as per the two categories which indicate the sensitivity of the counting models to the 
luminescence variance. 


4.3. Result analysis 

Figure 5 (a) and Figure 5 (b) clearly shows that, from 0 to 10 epochs there is no considerable change 
in pixel loss, whereas from 10 to 20 epochs we have 10-pixel loss. However, for 20 to 30 epochs to 40-52 
epochs it continues to increase the pixel loss. Finally, pixel loss at 52 epochs becomes 16.0 the other hand. 
We can get Test valid loss while, minus Test loss from Training loss. More importantly, in Test valid loss at 
40 epochs the pixel becomes 17 and at 52 epochs the pixel loss becomes 14 in number. At the same epochs, 
we calculated the test MAE based on above equation. After testing we have calculated the test valid loss and 
test valid MAE that you have seen the above Figure 5 (c) and Figure 5 (d) at the test MAE, we have seen 
that, when the epoch is O then error is above 600. After that when the epochs are increasing then error is 
going down. After completing 52 epochs, we have seen the error is coming down 240.0. In the test valid 
MAE, we have seen that, when the epoch is O then error is above 425. After that, when the epochs are 
increasing then error is going down. After completing 52 epochs, we have seen the error is coming down 
255.0. Figure 5 shows the graphical representations of the result analysis. 


4.4. Compare the proposed model with other state-of-the-art method 

Hajj-crowd dataset is a large-scale crowd counting and density dataset. It includes 200 training 
images and 165 test pictures with the same 1280 x 720 resolution. Some results from the mainstream 
technique (Idrees et al. [18] Yu et al. [19], [20], FCN [21], Cascaded MTL [22], MCNN [23] and so on) on 
UCF_CC_S50 datasets, are compared against those of non-pre-defined techniques, Idrees et al. attains the 
finest MAE of 419.5, followed by MSE of 541.6. Our method outperforms the state-of-the-art method, in the 
context of new dataset (named HAJJ-Crowd dataset), which attains a remarkable MAE result of: 240.0 
(177.5-point improvement) and MSE of 260.5 (280.1-point improvement). Table 1 Estimation of errors on 
the UCF CC 50 dataset. 


Table 1. Estimation of errors on the UCF CC 50 dataset 


Method Dataset Name MAE MSE 
Idrees et al. [18] UCF_CC_50 419.5 514.6 
Yu etal. [19] UCF_CC_50 467.0 498.5 
MCNN [23] UCF_CC_50 377.6 509.1 
FCN [24] UCF_CC_50 338.6 424.5 
Cascaded-MTL [25]  UCF_CC_50 322.8 341.4 
Ours HAJJ-Crowd dataset (Developed by own) 240.0 260.5 
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Figure 5. Result analysis, (a) test loss, (b) valid loss, (c) test mae, (d) valid MAE 


5. CONCLUSION 

This research presents a new crowd density prediction model using convolutional neural network. 
The current model of convolutional neural network uses a multi-column structure of highest level-down 
processing of feedback to address the issues in massive crowds. Unlike the abovementioned model, the 
proposed model can detect moving crowds. The result of this experiment exhibits better performances 
compare to other methods. In particular, crowd analysis increases the counting efficiency for highly 
congested crowd scenes considerably. Upcoming research will consider better detection of crowds and would 
make head sizing more accurate annotation. 
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